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Abstract 

Background: Many QTL studies liave two common features: (1) often tliere is missing marl<er information, (2) among 
many marl<ers involved in the biological process only a few are causal. In statistics, the second issue falls under the 
headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for 
QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing 
genotypes. Qutcomes of the proposed imputation method are probabilities which serve as weights to the second 
step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for 
the trait of interest. 

Results: Simulation studies validate the proposed methodology under a wide range of realistic settings. 
Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. 
The methodology was applied to an Arabidopsis experiment, containing 69 markers for 1 65 recombinant inbred lines 
of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On 
the basis of the inferred RQC behavior these markers show good potential for being real, especially for the 
germination trait Gmax- 

Conclusions: Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to 
alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple 
regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under 
realistic missing data settings this methodology can be used for QTL identification. 

Keywords: Arabidopsis, Germination traits, QTL mapping. Recombinant inbred line (RIL), Binary genotypes. 
Likelihood-based genotype imputation, Sparse variable selection. Weighted lasso 



Background 

Quantitative traits are traits that vary continuously. The 
phenotype of a quantitative trait (QT) is the cumulative 
result of several genes, their interactions and the environ- 
ment. Regions within genomes that contain genes associ- 
ated with a particular QT are known as quantitative trait 
loci (QTL) [1]. The major biological question is to iden- 
tify the QTL associated with variation in traits. Under- 
standing the genetic architecture of quantitative traits is 
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important for animal and plant breeding, medicine, and 
evolution. For example, plant breeders can use the QTL 
identified for seed quality to select and breed plants with 
certain desirable characteristics. Molecular markers are 
specific fragments of DNA that represent the genetic 
differences between individual organisms or species [1]. 
Development of molecular (or genetic) markers creates 
new opportunities for QTL identification. Markers are 
not usually targets themselves but act as "flags" for genes 
controlling the trait. Molecular markers that are closely 
located and tightly linked to genes that control the trait 
are referred to as "tags" . 

The process of coupling the phenotype (i.e. trait 
measurements) and genotype (i.e. molecular markers) 
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data followed by QTL analysis is known as QTL map- 
ping. The aim of QTL mapping is to identify the markers 
which are tightly linked to genes affecting the trait as 
well as to estimate the magnitude of their effects. Most 
methods consider repeated single QTL models, but it 
is now understood that modeling multiple QTLs simul- 
taneously, as we consider in this paper, is superior to 
single QTL models [2]. Often both the phenotype and 
genotype data are incomplete. Though imputation meth- 
ods for phenotype data in the context of QTL mapping 
is quite well-developed [3,4], there is less consensus on 
imputation of missing genotype data, due to its categori- 
cal nature. Two major strategies for genotype imputation 
are based on: (1) a maximum likelihood and (2) multiple 
imputation strategies [5] . Although multiple imputation is 
potentially flexible, it tends to be slow for large fraction 
of missing values. Therefore, we propose a likelihood- 
based method. In the context of QTL mapping, existing 
genotype imputation methods use phenotype data and 
multiple generation information to obtain a conditional 
probability of a missing genotype [6]. These methods are 
design-specific and lack generalizability [6,7]. Most com- 
monly, the missing genotypes are replaced with predicted 
values that are based on the observed genotypes at neigh- 
boring markers, as in the multiple QTL mapping (MQM) 
algorithm [8,9]. 

Due to the roughly Markov structure of the meiosis pro- 
cess, we introduce a probability imputation method for 
markers with binary genotypes that includes information 
only from immediate neighbors. This method is applied 
to recombinant inbred line (RIL) experiment, though it 
can be extended to other mating designs with binary 
genotypes (e.g. backcross, double-haploid). Clearly, our 
method is applicable to a wider set of designs and it 
does not require the phenotype data in order to com- 
pute a probability for missing genotype. In contrast to 
others [8], the recombination rate is not estimated sepa- 
rately but rather a specific parameter is computed within 
the algorithm that plays a similar role. Our imputation 
method considers two separate models, one for markers 
at the edge of a chromosome and another for all others. 
Each model requires an estimation of a recombination 
rate parameter The model-specific parameter for mid- 
dle markers is estimated as a function of the genotypes 
of the two flanking markers and the genetic distances 
towards those neighbors. A distinguishing characteristic 
of our imputation method is that the outcomes are prob- 
abilities which correspond to weights of observing one of 
the two parental lines at that locus. We integrate these 
weights into a lasso [10] to advance the QTL identifi- 
cation. The proposed analysis pipeline is validated using 
extensive simulations and compared to alternative meth- 
ods. It is then applied to a real dataset that motivated our 
method and which is described next. 



Motivating example 

The primary biological goal of this work is QTL detec- 
tion and the eventual goal is to improve the quality of seed 
production in Arabidopsis thaliana. It has been shown 
that measurements of the germination rate of maize in 
the laboratory could predict the relative performance in 
field sowing [11]. An increase in sowing performance can 
result in economically important crops. A similar strategy 
is taken in our study where germination characteristics 
of Arabidopsis seeds were examined in order to find 
QTLs associated with each trait [12]. Lines from recom- 
binant inbred population are important and convenient 
resources to study the genetic mapping of quantitative 
traits in plants or animals. 

All RILs have the same parents. Each RIL has a unique 
combination of loci derived by recombination of the alle- 
les present in the parents. Thus, each RIL has a unique 
genetic make-up. One traditional way of RIL construction 
is to cross two parental plants to produce an Fl genera- 
tion, followed by several consecutive generations of self- 
mating. This results in a so called "core population". These 
lines are practically homozygous and can be propagated 
indefinitely as clones. Biological and technical details of 
the RIL procedure are shown in Figure 1. 

In our study, F8 seeds from 165 lines of Arabidop- 
sis were obtained from the Versailles Biological Resource 
Centre [13]. The seeds were the results of cross between 
Bayreuth and Shahdara Arabidopsis plants, using an 
inbreeding approach over eight generations. Bay-0 orig- 
inates from a fallow-land habitat near Bayreuth in 
Germany, whereas Shahdara grows at high altitude in 
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Figure 1 Arabidopsis RIL procedure. Left-RIL procedure with 
self-mating over 8 generations; right-Arabidopsis plant (source: 
http://www-ijpb.versailles.inra.fr/), namely left, the vegetative stage, 
before flowering and growth of the floral stalk (bottom left). On the 
center an adult plant at full flowering/seed set. On the rigth, flower, 
floral stem and seeds. White bars represent 1 cm, except for flower 
and seeds: 1 mm. 
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the Pamiro-Alay mountains in Tadjikistan [14]. The Bay- 
0 and Sha RIL populations have been used in several 
previous studies to map QTLs [12]. Arabidopsis has five 
chromosomes. For every RIL, 69 markers were genotyped 
with an average genetic distance of 6.1 centimorgans 
(cM) between markers [14]. Of the 69 included markers, 
respectively 18, 11, 12, 11 and 17 markers are located on 
chromosomes 1, 2, 3, 4 and 5. The lengths of chromo- 
somes are 91.3, 64.6, 72.2, 69.1 and 91.2 cM respectively 

Arabidopsis germination experiment 

The phenotyping experiment was conducted in two 
stages: (1) seed sowing followed by measurements of 
grown plant traits and (2) collecting seeds from these 
plants followed by germination [12]. In this study we 
examine traits from the second part of the experiment, 
namely germination. In the first stage of the experiment 
(2008), Arabidopsis seeds obtained from Versailles were 
randomly allocated to three plates and grown in a cli- 
matized chamber. The plates can be considered technical 
replicates, each with 3-5 RIL plants. One year later, seeds 
stored in 2008, were planted on a fourth plate. In addition, 
the best seeds (free of fungus, etc.) from the first three 
plates were collected and germinated in 2009 on a fifth 
plate. In the second stage of the experiment, 50-100 seeds 
from each line of the core population of grown plants were 
collected and germinated. 

Several factors were varied or simply needed to be 
accounted for in germination experiment, such as seed 
age, dormancy, imbibition, growing plate (and selec- 
tion), temperature, and chemical stress. With respect 
to dormancy, i.e. storage conditions, two types can be 
identified, namely fresh and after-ripened (AR) [12]. 
Besides normal temperature, two types of tempera- 
ture stresses were applied, namely cold and heat shock. 
The following chemical stresses were applied: table salt, 
osmotic-inducing mannitol, oxidizer hydrogen peroxide, 
inhibitor abscisic acid (ABA), and controlled deteriora- 
tion (CD) [12]. Germination process under all chemicals 
except hydrogen peroxide was carried out in light and 
dark imbibition conditions. 

Cumulative germination data were gathered to estimate 
the germination performance. Five relevant parameters 
from the germination-time curve were extracted using 
the Germinator package [13]. These parameters are (1) 
the percentage of maximum germination, Gmax. indicat- 
ing the maximum germination capacity of a seed lot, (2) 
the time to reach 10% of germination, Tio, indicating 
initiation of germination, (3) the time to reach 50% of ger- 
mination, T50, indicating rate of germination, (4) the time 
between 16% and 84% of germination, U8416, indicating 
uniformity of germination, and (5) the area under the ger- 
mination curve between t=0 and t=100 hours, AUCioo 
(see Figure 2). 
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Figure 2 Cumulative germination curve, described by G^ax (%), 
T50 (h), Tio (h), U8416 (h), and AUCioo (%x h). 



Statistical design 

All five traits are continuous traits (Gmax (%)> T50 (h), Tio 
(h), U8416 (h), AUCioo (%x h)). Higher Gmax means more 
germination. Lower Tio and T50 mean faster germina- 
tion. The aim of this QTL analysis is to find the markers 
that are associated with these five response variables. We 
use the terms "(quantitive) trait" and "response variable" 
interchangeably. 

The set of explanatory variables are the 69 markers. Our 
genetic dataset contains genotypes of 69 markers for 167 
plants (including parental Bay and Sha). Marker geno- 
types for Bay and Sha are denoted by 0 and 1 respectively. 
Genotypes across all markers are the same for each par- 
ent. Genotypes in children plants are inherited from either 
of the parents, therefore each marker has only two pos- 
sible genotypes {0, 1}. So, technically, the 69 markers can 
be seen as a distinct combination of O's and I's across the 
RILs. 

All 167 plants are treated under 42 different conditions 
resulting in 7014 observations for every trait, some of 
which are missing. These conditions are made up from 
combination of the factors (described in previous section), 
which are not of primary interest, but which must be 
taken into account. Conditions are age, dormancy (Fresh, 
AR), plate (1-5), imbibition (light, dark), temperature 
(8, 10, 20, 25, 30 degrees Celcius) and chemical stress (no, 
salt, mannitol, hydrogen peroxide, ABA, CD). Hence, each 
response variable is adjusted for these nuisance variables. 
Such adjustment confirms that any detected marker effect 
is robust under all these conditions. 

Missing data 

The phenotypes Gmax. T50, Tio, U8416, and AUCioo con- 
tain respectively 0.49%, 1.90%, 1.90%, 1.92% and 0.49% 
missing data. It seems reasonable to assume that the miss- 
ingness of any observation for a given trait is independent 
of the observed and unobserved values. Such missing 
mechanism is known as missing completely at random 
(MCAR). Furthermore, the small percentage of missing- 
ness across all phenotypes means that we can safely omit 
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the missing observations from our analysis, even if the 
MCAR assumption is not true. 

We summarized the number of missing markers per 
plant and the number of missing plants per marker. About 
25% of RILs do not have missing values for any marker. 
The remaining RILs have up to 9 missing values, with only 
two RILs have 20 and 24 missing markers. As for the num- 
ber of missing plants per marker, we counted that each 
marker has at least one missing RIL. The number of miss- 
ing RILs per marker vary between 1 and 13. Missingness 
in markers may be caused by essay quality, poor hybridiza- 
tion and/or other reasons. Such nature of missingness 
is unlikely to be MCAR, as the missingness may well 
statistically depend on whether its neighbor is missing - 
due to the sequential operation of the genotyping instru- 
ment. This missingness feature might then be described 
as missing at random (MAR) since the missingness does 
not depend on the unobserved markers themselves, but 
it does depend on the observed markers, i.e. we observe 
whether its neighbor is missing or not and this enables 
prediction of the probability that this marker is missing. 

Methods 

Marker probability model 

There are only two possible genotypes for each marker in 
a RIL experiment. Let xl!j be the parental type for a RIL i 
at chromosome c at genetic location t. Namely: 



(,) I 0, if parental type at (c, t) for RIL / is Bay 
1 1, if parental type at (c, t) for RIL i is Sha. 



(1) 



We assume that for each marker the genotype at position t 
depends on the genotypes of two immediate neighboring 
markers at positions to and ti, as well as the distances (ti — 
t) and (t—to) assuming to < t < ti.lt would be reasonable 
to assume an Ising model on switching the genotypes from 
one marker to another: 



A') 



s<toUs>ti 



where to and ti are genetic locations of flanking markers 
and xl-'l'fg and ^v^'^^ are genotypes of those markers. In our 
case genetic locations to, t and ty refer to genetic distances 
from starting point of a chromosome. 

As stated above, markers in a RIL have only two geno- 
types {0, 1}. There are two possible sources of the genetic 
variability, genetic recombination and mutation. Recom- 
bination/meiosis is a process of chromosomal crossover 
whereby two chromatids can mesh with one another. The 
variations, isolated by breeders are the result of recom- 
bination and not mutation due to short period of time 
involved with the isolation of the varieties. Variation due 



to mutation on the time-scale of this experiment (i.e. 6/8 
generations) is dwarfed by the variation of recombination. 

This physical intertwining during meiosis induces nat- 
urally (though not necessarily) a Markov dependence 
structure, whereby knowledge of the configuration of 
chromatids at a particular marker depends only on the 
neighboring configurations. With respect to the shape 
of the dependence structure, we note that it is natural 
that the absolute value of the derivative of the probabil- 
ity model is minimal exactly half-way through the interval 
between two known markers. This reflects the fact that 
the amount of change of information is smallest when we 
are further away from the known markers. This leads to 
four possible scenarios of a marker with immediate neigh- 
bors. Figure 3 shows two of such scenarios, the remaining 
two are symmetric (about the x-axis) of the ones depicted 
on this Figure. These scenarios are meaningful only for 
markers having both neighbors. In turn, a marker has 
both neighbors if it is not at the edge of a chromosome. 
As for edge markers, only two scenarios are possible (not 
depicted here). Considering scenarios given in Figure 3, 
we propose a model that exhibits the same shape. We 
introduce a parameter a e [ 0; oo) which technically has a 
scaling function and biologically it has a role similar to the 
recombination rate. The probability model for a RIL i and 
a marker located in the middle of a chromosome is: 



71 (a) = P \x\'t = l\xl}^ = xo,xlt^ =Xlj 

1 1 / ti - t \loi(ti-to)+l] 

= 2 + 2^"''V^r^j 
2 ""Hi, - in/ 



(3) 



la(ti-tQ)+l] 



scenario 1 



J3 

o 




X3 
CO 

O 




location 



location 



Figure 3 Possible scenarios for genotypes of tliree consecutive 
markers located at locations to, t (at mark ?) and ti for RIL /: 
(scenario 1 ) y axis shows the probability of observing Sha at 
location t given we observe Sha at locations to and , 
P(xc,t = 1 \xc,to = 1/^c,ti =1); (scenario 2) y axis shows the 
probability of observing Sha at location t given we observe Bay 
at fo and Sha at ti,P(xc,t = 1 \xc,to = 0,Xc,t^ = 1). 
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where xq, xi e {0, 1} and 5^ is defined by: 



Note, for a marker surrounded by two known markers, if 
a = 0 this means the recombination rate is zero. There- 
fore, (3) gives us that the probabiUty of a Sha marker 
between two given Sha markers is 1. In contrast, if a — >■ oo 
then the recombination rate is infinite meaning that there 
is no information in neighboring markers. Therefore, (3) 
gives us a probability of 0.5 for any value of the flanking 
markers. 

Likewise, for markers at right edge of a chromosome the 
probability model involves a parameter e[ 0; oo): 

7T(p) ^ P (4] = = Xo) = ^ + \s,,p('-''>\ (5) 

where xq e {0, 1} and Sx is defined as in (4). The probabil- 
ity model for the markers at left edge of a chromosome is 
similar to equation (5). 

Pseudo maximum likelihood In imputation 

To estimate the parameters a and /! in probability mod- 
els (3) and (5) we use only the children plants and not the 
parents. The pseudo log-likelihood [15,16] of a model (3) 
for middle markers is: 

eia) ^ S,-,,,flnP(4]|43,_,,,x® 

= Ei,,,f[x®ln(7r(a)) (6) 

+ (l - 4;]) ln(l - 7T(a))] , 

from which we estimate a. In a similar way we estimate fi. 

We examined whether separate parameters are required 
for every chromosome. For middle markers we tested Hq: 
ai = a2 = — = as versus Hi: at least one a differs 
from others. Subsequently, we computed the pseudo log- 
likelihood on five parameters and compared it with the 
pseudo log-likelihood on one parameter using the pseudo 
likelihood ratio test (LRT) statistic: 

LET = -2[l{aoneChr) - UoiFiveChr)] ■ (7) 

The distribution of the LRT is the weighted sum of four 

independent xf distributions [17]. Since the pseudo like- 
lihood is probabilistically close to the true likelihood, the 
LRT can also be approximated by x^. Alternatively, the 
p-value can be calculated using a bootstrap approach. 

We also tested the goodness of fit of the proposed model 
using Pearson's chi-squared statistic. 

Genotype imputation 

The missing markers are substituted with a one or a zero, 
by rounding the probability J'(^c!f l"*^c!fo'*^c!ti) integer. 



The substituted value is more certain when the proba- 
bility is closer to zero or one. It is less certain when the 
probability is close to 0.5. In the analysis we will use 
the weights that would represent this uncertainty. The 
weights for the missing genotypes are: 

H.«=2|p(x«|x®x®)-^|, (8) 

where w^'^ e[ 0, 1]. Indeed, zero weights are used for geno- 
types with rounded probabilities equal to 0.5 and weights 
of ones are used when the imputed probabilities approach 
to zero or one. For non-missing genotypes, we assume that 
they are observed with complete certainty resulting in a 
weight of one. The highest possible weights are given to 
the observed genotypes. The weights w'^j. are computed 
from the imputed (predicted) probabilities P(x^^]). These 
probabilities were estimated via the maximum likelihood 
estimation procedure given in previous two sections. 

Phenotype response 

We adjusted every trait for the effect of the nuisance 
variables (age, dormancy, plate, imbibition, temperature, 
chemical, Bay, Sha). For observation i, the adjustment was 
carried out using a linear regression as shown on the 
example of Gmax below: 

Gmaxi = <po+ (piagei + ... ^m^hai + yu (9) 

where the residual variance is constant across all obser- 
vations and the residual is distributed independently and 

identically (iid), yi ~ A/^(0, t^). The model (9) employs 
m regression parameters (po, ■ ■ ■ , (pm- Then we computed 
the difference between the observed and fitted values for 
every observation and used these residuals as inputs to 
weighted lasso. 

Weighted lasso phenotype inference 

The original lasso weights all observations equally [10]. 
An adaptive lasso is an extension of lasso by weighting 
or penalizing different coefficient differently in a way that 
depends on the data [18]. The proposed weighted lasso 
(wlasso) is a different extension of the lasso by weighting 
different observations differently in a way that depends 
on the data and the value of the coefficients. An adaptive 
lasso places weights in the penalty part of the objective 
function, whereas wlasso places weights in the sum of 
squares part of the objective function. 
For convenience, we use in this section the notation Xij 

and Wij instead of x''^^^^. and vf^^'^^ for the genotype and 
weight information of RIL / at chromosome c/ at loca- 
tion tj, since information about chromosomal locations 
and genetic distances have already been incorporated into 
the estimates of the probabilities jr(a) or n(_P) and the 
corresponding weights. Below we describe the wlasso 
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algorithm. Let X be the matrix that contains original geno- 
type values {0, 1} and the rounded imputed probabilities; 
Xij is an element of this matrix for RIL i and marker 
j. yi is the residual response for RIL i, as described in 
previous section. We assume that the observations are 
independent. We define the wlasso estimate 6 as: 



6x = argmin 



!=i ^;=i 



=1 ni 




(10) 



The idea behind the method is to downweight obser- 
vations with a lot of imputed values Xij: note for instance 
that observations with all weights w,y zero are eliminated 
from the regression, whereas observations with a fully 
observed Xi,, and therefore all weights Wij equal to one, 
are fully taken into consideration. Moreover, observations 
that have only missing values on marker locations j that 
are deemed irrelevant for the regression, i.e. 9j = 0, will 
not be penalized for their partial missingness. The model 
naturally accounts for imputation imprecision, without 
letting irrelevant imputations affects the quality of the 
estimate and with the ordinary lasso as limiting case when 
no imputation was performed. 

Just as in ordinary lasso, the selection of "best" reg- 
ularization parameter X is not obvious. Various selec- 
tion methods, such as the Bayesian information criterion 
(BIC), Akaike information criterion (AIC) and cross- 
validation, have been proposed. As suggested [19], the BIC 
is the most relevant criterion when the sparsity of the 
model is of primary concern. The BIC tends to lead to con- 
sistent selection of X and quite sparse models with relevant 
biological interpretation. Therefore it is used in this study, 
i.e. we minimize the following objective function across X: 



BICiX) = J2 



{yi - Ef=i ^Kjxij) 



■df(X)ln(n), (11) 



where is some robust estimate of the variance that does 
not depend on X and df{X) = X!f=i -5^0) number 
of non-zero parameters in the model. 

Given that (10) cannot be minimized explicitly, we use 
an iterative procedure to obtain Ox, which given initial 
non-zero weights is guaranteed to converge to the global 

"(0) 

minimum. In practice, we define initial estimates 0 = 

(6[^\ 02^\ . . . , dp^^) for all markers using the regular lasso. 
Such initialization is the same as assigning weights of ones 
wf^^ = 1 to every RIL i. In iteration k + 1, the regression 
parameters are updated as: 



: argmin ^ wf^ yi 



7=1 



(12) 



+xY,\dj\ 



where the weights wf^ for each RIL in equation (12) are 
updated in an iterative way as: 



(k) 



(13) 



We are defining convergence as the first k, such that 
l^r^^^ — wf^\^ < €, where e is a predefined tolerance 
level. Using tolerance level e = 10~^ all five traits con- 
verged in 4 iterations. The plots of weights are included 
for visualization convenience (see Additional files 1, 2 
and 3). 

Results 

In this section, we first present the results of detecting 
QTL effects for germination in Arabidopsis. Then, we 
present the strategies and results from three simulation 
studies. With the first simulation study we aim to jus- 
tify our proposed methodology. In the second and third 
simulation studies we compare our methodology with 
alternative methodologies for the case study. The second 
study emphasizes a comparison of our imputation meth- 
ods with the nearest marker imputation. The third study 
focuses on comparison of sparse variable selection tech- 
niques, namely our weighted lasso, the traditional lasso 
[10] and adaptive lasso [18]. 

Analysis of Arabidopsis germination experiment 

For the genotype data, we estimated two recombination 
rate parameters, depending on whether the marker was 
on the edge or in the interior of a chromosome. The 
parameters a and p in probability models (3) and (5) 
were estimated using 165 children plants. The maximum 
pseudo log-likelihood estimates for these parameters 
were a = 0.0047 and /3 = 0.9524 (see Figures 4 and 5). 
The need for introducing a separate parameter for each 
chromosome is shown not to be necessary for interior 
markers (LRT=4.633, df=4, chi-square p-value=0.327, 
bootstrap p-value=0.456) and for edge markers 
(LRT=4.116, df=4, chi-square p-value=0.391). The good- 
ness of fit of the proposed recombination models for 
interior and edge markers was tested using Pearson's 
chi-squared statistic. The results suggest a good fit of 
both models (for interior marker: x^=7458.23, df=9280, 
p-value=l; for edge markers: x^=1210.51, df=1591, 
p-value=l), suggesting that we do not need more than just 
the flanking markers to infer the genotype of the missing 
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Figure 5 Plot of log-likelihood for edge markers with maximum 
at 0 = 0.9524, 1 = -594.24). 



marker. This is in agreement with the traditional meiosis 
model of recombination. 

We adjusted all 7014 observations of every trait using 
regression model (9). Number of regression parameters 
<po, ■ ■ ■ , (pm IS high since some of the input variables are cat- 
egorical. In total we estimated m = 15 parameters. Then, 
the residuals were modeled using wlasso, as described in 
previous sections. The marker effects, demonstrated by 
regression coefficients in wlasso, are presented in Figure 6. 
These values will be used in the simulation studies below. 
The BICs for all five traits are shown in Figure 7. For 
each trait, those markers are selected by wlasso as indi- 
cated by the minimal BIC value. Thus, on the basis 
of BIC 29, 10, 15, 11 and 22 markers are selected for 
Gmax. T50, Tio, U8416. and AUCioo respectively Clearly, 
the number and selection of markers differ for each 
trait. They represent in total 39 out of the 69 available 
markers. 

We computed the LRT under the full and a restricted 
models on a 10-base logarithm scale for markers selected 
by wlasso. We present the 39 marker names, genetic dis- 
tances and respective LRT statistic in Table 1. The larger 
the LRT statistic, the stronger the evidence in favor of a 
QTL effect at a particular location of the chromosome. 
The LRT statistic below 0.01 was substituted by 0. We 
would like to emphasize that detecting peaks that contain 
several markers increased our confidence in the region 



0 

% 




0 

0 

0 




0 

0 0 
00 " * 












0 




0 0 




00 








fori 
-0.5 


0 0 


0. \ 






























°o » . 










0° ° S° 




















% 




0 




0 



0 20 40 60 0 20 40 60 0 20 40 60 

wlasso Steps wlasso steps wlasso steps 



0 




0 


0 






0 

0 


JC100 
1 1 


0 
0 


0 

" ° 0 


< 


0 

0 




^jO 


0 




0 


0 20 40 60 




0 20 40 60 



wlasso steps wlasso steps 



Figure 6 Plots of weighted lasso regression coefficients 0 vs. 
number of steps in weighted lasso for traits Gmax/Tso, Tio, U8416, 
and AUCioo- 
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being associated with a trait. For example, the beginning 
of chromosome 3 (with markers MSAT399, ATHCHIB2, 
MSAT305754) and middle of chromosome 4 (MSAT415, 
CIW7, MSAT418) give strong evidences of these regions 
to be associated with Gmax (we see a set of genes with 
a high LRT statistic). Loci at chromosomes 3 and 5 are 
strongly indicative for all five traits. We also visualized the 
LRT statistic across genetic distance and presented the 
plot for T50 in Figure 8. Five peaks with eight markers 
are seen for T50 peaki-MSAT399, ATHCHIB2; peak2- 
MSAT332; peak3-MSAT49, MSAT468; peak4-MSAT514, 
NGA139; peak5-MSAT520037. We have the highest con- 
fidence in peaks since it contains multiple markers having 
the large LRT statistic. All peaks except the one at chro- 
mosome 4 have been detected by biologists as well [12]. 
Thus, we have found one additional peak for T50. 

None of the detected peaks at chromosomes 2 and 4 
were identified before [12]. Peaks at chromosome 2 are 
relatively low. The region with MSAT25 provides rel- 
atively high confidence for association with Tiq. The 
duo markers (MSAT238 and MSAT241, IND216199 and 
MSAT210, MSAT210 and MSAT222), with a considerable 
LRT value, demonstrate a certain confidence of associa- 
tion with Gmax. Tio and U84i6 respectively. Thus, regions 
at chromosome 2 should be considered among QTL 
despite their low LRT statistic. Similar interpretations 
apply to detected regions at chromosome 4. In addition, 
several peaks detected by us at chromosomes 1, 3 and 5 
have not been identified by others [12]. 



For comparison with LRT given above, we computed 
the LOD-score for every marker across each trait. The 
LOD-score is essentially the LRT statistic on a 10-base 
logarithm scale. It measures the strength of evidence for 
the presence of a QTL at a particular location. The LOD- 
score for each marker is computed as the LRT under the 
full and a restricted models. The full model includes all 
69 markers, while the restricted model includes all but the 
marker of interest. We present the LOD-scores for T50 in 
Figure 9 and it is clear that some of the peaks from the 
LRT are much sparser and that several large peaks have 
disappeared. This shows the instability of the ordinary 
multiple regression approach as compared to the stable 
lasso method. 

Simulation strategies 

Simulating genotype data by an Ising model (interac- 
tion parameter ;?) is quite realistic for a recombination 
process on a chromosome [20]. The parameter shows 
the strength of the dependence between markers. We 
considered that inheritance at loci on different chromo- 
somes are independent events and simulated the markers 
of every chromosome one at a time. We included the 
dependence between markers in two ways: (1) the depen- 
dence between equally-spaced neighboring markers using 
an Ising model with i] = 0.4, (2) the genetic distances 
between the observed markers by subsampling the full 
process. In particular, we rounded the genetic distance of 
every chromosome and simulated markers with genotypes 
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Table 1 LRT statistic of marlcers selected by weighted lasso 
for Gmax» Tso, Tio, U8416/ and AUCioo ordered by genetic 
distance across 5 chromosomes in Arabidopsis 



Marker 


chr 


gdist 


Gmax 


Tso 


Tio 


^841 6 


AUCioo 


MSAT1 00008 


1 


0 








0.64 




F21M12 


1 


9.7 


0,09 










IND4992 


1 


154 




0.04 




0.32 




MSAT110 


1 


21.6 


0,02 








0 


MSAT108193 


1 


26.6 


0.14 








0.21 


T27K12 


1 


49.1 


1.27 








0.65 


F5I14 


1 


69.6 


0 








0 


MSAT1 27088 


1 


82.7 


0.15 








0.33 


MSAT15 


1 


91.3 


0.01 






0.81 


0 


MSAT25 


2 








0.19 






MSAT200897 


2 


7.9 




0.02 








MSAT238 


2 


13 


0,16 








0.15 


MSAT241 


2 


35 


0,09 




0 






IND216199 


2 


51.5 






0.03 






MSAT210 


2 


57.9 






0.17 


0.24 




MSAT222 


2 


64.6 








0.86 




MSAT399 


3 


3.2 


0,68 


0.79 




0.58 


1.03 


ATHCHIB2 


3 


6.6 


0.57 


0.02 


0 




0.16 


MSAT305754 


3 


7.9 


0.58 










MSAT319 


3 


23.2 


0.03 








0.13 


MSAT332 


3 


39.5 


0,17 


0.38 


0,51 


0.03 


0.31 


MSAT321 


3 


48 


0,33 






0.09 


0.25 


MSAT3 18406 


3 


53.3 


0,05 








0 


MSAT318 


3 


64.1 






0 






MSAT370 


3 


72.2 


1,28 










MSAT48 


4 


2 


0,09 




0,04 






MSAT443 


4 


10.7 










0.52 


MSAT41 5 


4 


33.5 


045 










CIW7 


4 


45 


1.22 








029 


MSAT41 8 


4 


47 


0,03 










MSAT49 


4 


55.6 


0,06 


0.60 


0,31 




0 


MSAT468 


4 


61.8 




0.33 


0,17 






MSAT500027 


5 


0 


0,34 








0.5 


MSAT514 


5 


26.6 


0 


0.16 


0,01 


046 


0.09 


NGA139 


5 


304 


0,42 


0.03 


0 




0.13 


MSAT520037 


5 


674 


0.9 


0.30 


0.27 


0.01 


1.07 


MSAT512 


5 


71.6 


0.13 




0,39 




0.31 


MSAT519 


5 


85 


026 






0.01 




K9I9 


5 


91.2 


0,35 




0,08 




025 



{0, 1} 1 cM apart for 165 RILs. Then we selected those 
markers which were spaced with the same genetic dis- 
tances as markers in our RIL experimental data. Markers 
for every chromosome were simulated independently and 
then joined together as a genotype dataset. 

We assumed that among all observed markers about 
10% have the true QTL effect. Thus, the largest posi- 
tive and negative 6* of 6 markers from weighted lasso 
of our real experiments were selected as the true effects 
(see Figure 6). Among the simulated 69 markers, 6 evenly 
spaced markers (along five chromosomes) were selected 
as the true input variables. An additive effect of mark- 
ers was assumed and the response variable was generated 
using the multiple regression model. Residual error, hav- 
ing the normal distribution with mean = 0, was added 
to the trait. We studied our method under several val- 
ues of residual error variances, namely cr^ = 0.5, 1, 2, 3. 
We also investigated our methodology with 6 true mark- 
ers being clustered (3 markers on the first chromosome 
and 3 on the second one) and compared it with the case of 
evenly-spaced markers. 

We studied two missing mechanisms among mark- 
ers: (1) an MCAR using Bernoulli missingness and (2) 
MAR using an Ising missingness model. Following our 
experimental data, we explored the case with 10% of 
missingness. Thus, the probability parameter in Bernoulli 
distribution is 0.1. We assumed the stronger dependence 
f]MAR = 0.6 among missing markers than simply among 
observed and non-observed markers {ri = 0.4). For 
every above described scenario, we carried out 50 simu- 
lations. The simulated data were analyzed using the pro- 
posed imputation model and wlasso as well as alternative 
approaches. For every simulation scenario, we summa- 
rized the performance of the tested methodology using 
the receiver operating curve (ROC). For that we measured 
the fraction of true positives out of all positives, so called 
true positive rate (TPR) and the fraction of false positives 
out of the negatives, so called false positive rate (FPR). 
To be specific, TPR=TP/(TP+FN) and FPR=FP/(FP+TN), 
where TP, TN, FP, and FN are the numbers of true pos- 
itives, true negatives, false positives, and false negatives. 
TPR and FPR are also known as the sensitivity and (1- 
specificity) respectively. 

For the sparse variable selection methods, we studied 
the TPR and FPR across a range of the lasso regularization 
parameter X (BIC is not employed here as it was for real 
case study). 

Simulation study 1: justification of the proposed 
methodology 

The ROC curves for equally-spaced markers with MCAR 
and MAR are demonstrated in Figures 10 and 11. We 
see, that as the residual error variance increases, both 
the TPR and FPR decrease. This trend was very similar 
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for clustered markers with both missing mechanisms. 
Though, as the residual error variance increases {a^ > 1), 
the TPR and FPR drop more for clustered markers than 
for equally-spaced markers in MAR (see Figure 12). 

To evaluate the accuracy of RIL experiments, we have 
to examine the results from real data in light of the sim- 
ulation results. Traits Gmax and T50 were examined as 
examples. In the simulation studies: regression coeffi- 
cients were 9 = 0.5 (and up), investigated residual error 
variances were = 0.5, 1, 2, 3, and the number of RILs 
was 165. In experiment for T50: regression coefficients 
were 9 = 0.5 (and up), residual error variance was = 
150, and the number of observations was n = 165 x 42 = 
7014. These are equivalent to simulations with 165 RILs 
and (7^ « 3.5. In experiment for Gmax: 9 = 0.02 (and 
up), CT^ = 0.04, n = 7014. These are equivalent to sim- 
ulations with 9 = 0.5 (and up), 165 RILs, and ct^ = 
0.04(0.5/0.02)2/42 0.6. From these residual error vari- 
ances, we can find the corresponding ROC curves for T50 
and Gmax (see Figure 10 or 11). Thus, the results of T50 
experiment are less powerful given the overall low ROC 
curve (ct^ = 3). In contrast, the results of Gmax are highly 
stable, given the overall high ROC curve (ct^ = 0.5). 

Simulation study 2: comparison of tlie proposed 
metliodology focusing on imputation metliods 

The nearest marker imputation and multiple regres- 
sion are frequently employed methods for QTL analysis. 
Our imputation was compared with the nearest marker 



imputation and wlasso was compared with the multiple 
regression. As a result we examined four models: (1) our 
imputation and wlasso, (2) our imputation and multi- 
ple regression, (3) nearest marker imputation and wlasso, 
(4) nearest marker imputation and multiple regression. 
Whenever wlasso was included in a simulation we studied 
the TPR and FPR across a range of the tuning parameter 
A. Whenever multiple regression was included in a sim- 
ulation we used the full significance level range [0, 1] by 
increments of 0.015. Such increment results in 67 steps 
which are of the same order as number of steps in the 
wlasso. All four models were applied to simulated data 
described above and were studied for equally-spaced and 
clustered markers with MCAR and MAR mechanisms. 
The results were summarized using ROC curves. The 
ROC curves across four models for clustered markers 
when the residual error variance is small (ct^ = 0.5) 
are shown in Figures 13 and 14. Our model 1 outper- 
forms others under all scenarios. We also show similar 
plots for larger residual error variance (a^ = 3) when 
the markers are evenly-spaced (see Figures 15 and 16). 
Clearly, as the residual error variance increases, our Model 
1 has more pronounced sensitivity and (1 -specificity) than 
other models have. We also demonstrate the results for 
intermediate variance (a^ = 1) when markers are evenly- 
spaced and clustered (see Figures 17 and 18). Interestingly, 
for smaller residual error variance 

(0-2 < 1), the ROC 
curves of Model 2 are slightly above the curves of Model 
3. This implies that the probabilistic imputation method 
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Figure 1 7 ROC curves comparison across 4 models for evenly-spaced markers with IVIAR mechanism when = 1 . 



with multiple regression has slightly higher accuracy than 
nearest marker imputation with wlasso. For larger resid- 
ual error variance (cr^ > 1), the ROC curves of models 
2 and 3 are approximately on top of each other, implying 
that the improvement for both wlasso and the probabilis- 
tic imputation method is roughly the same. However, just 
comparing our imputation method with nearest marker 
imputation for wlasso (Model 1 vs. Model 3) and for 
multiple regression (Model 2 vs. Model 4) demonstrates 
that our imputation method outperforms nearest marker 
imputation. 

Simulation study 3: comparison of the proposed 
methodology focusing on sparse variable selection 
techniques used for phenotype inference 

Our wlasso is compared with the classical lasso and 
an adaptive lasso with three weighting schemes [18]. In 
an adaptive lasso, we estimated the weight vector as 
w adaptive = l/I^OisT. where 0OLS is a vector of ordinary 
least square estimates and y = 0.5, 1, 2 [18]. All five mod- 
els were applied to the simulated data described above. 
For lasso and adaptive lasso, the imputed probabilities 
were rounded towards zeros and ones after the imputa- 
tion (ignoring our weighting procedure). Thus, an input 
matrix to lasso and adaptive lasso contained genotype val- 
ues {0, 1}. We investigated settings for equally-spaced and 
clustered markers with both, MCAR and MAR mecha- 
nisms. Again, the results were summarized using ROC 
curves. The ROC curves of the five models for both. 



evenly-spaced and clustered markers with MCAR mecha- 
nism when the residual error variance cr^ = 1 are shown 
in Figures 19 and 20. Similar plots for MAR are presented 
in Figures 21 and 22. Clearly, the wlasso is more accurate 
than other four alternatives. This accuracy is maintained 
across the investigated variances of all magnitudes (cr^ = 
0.5, 1, 2, 3), see Additional files 4, 5, 6, 7, 8, 9, 10, 11, 
12, 13, 14, and 15. An obvious advantage of wlasso is 
observed for both, evenly-spaced and clustered markers 
with MAR mechanism (see Additional files 6, 7, 10, 11, 
14, and 15). For clustered markers with MCAR, the wlasso 
has lost an obvious advantage but still remains at the same 
accuracy level as lasso when the residual error variance 
(cr2 = 2,3) increases (see Additional files 9 and 13). 
Though, for evenly-spaced markers with MCAR and large 
residual error variances (ct^ = 2, 3), the wlasso main- 
tains noticeably higher accuracy levels than the other four 
approaches (see Additional files 8 and 12). 

Discussion 

The simulation studies have shown that the combina- 
tion of the proposed probabilistic imputation method 
and wlasso is an accurate methodology for QTL analy- 
sis. The pipeline suggested in this paper has an advantage 
of computational speed. An alternative to the proposed 
likelihood-based imputation is multiple imputation [21], 
but it is slower and leads every time to a possibly differ- 
ent result. The wlasso is used to advance the selection of 
markers associated with a trait. 
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In this paper, we analyzed each of the five traits of the 
Arabidopsis separately. In principle, it is possible to anal- 
yse the traits jointly, as they are all traits associated with 
germination. Clearly, the QTLs shared by all traits can be 
analyzed further. To identify whether these loci are causal 
or reactive for a particular trait is an interesting follow- 
up question. Possible causal relationships among a trio 
of two traits and a QTL is summarized by others [22]. 
Their approach can be applied to various pairs of selected 
Arabidopsis traits and extended to a quintet of traits in 
order to determine the type of relationship (for exam- 
ple, independent, reactive, causal) existing among traits. 
Though, this goes beyond the scope of the paper. 

Conclusions 

Our methodology has high accuracy in terms of sen- 
sitivity and specificity for clustered and evenly-spaced 
markers for both, MCAR and MAR missing mecha- 
nisms. Clearly, the accuracy increases as the magnitude 
of the residual error variance decreases. In comparison 
with other approaches, our proposed methodology out- 
performs alternative methods under most investigated 
scenarios but is never worse than any of the approaches. 
More specifically, our probabilistic imputation method 
is more accurate than the nearest marker imputation. 
Also, our wlasso is more accurate than commonly prac- 
ticed multiple regression, the traditional lasso, and adap- 
tive lasso (with the three selected weighting scheme). 
More importantly, our methodology has been biologically 



validated on an Arabidopsis study and demonstrated good 
accuracy. In conclusion, the proposed methodology can 
be used for QTL identification, especially under the real- 
istic setting of missing genotypes among markers. 

Additional file 



Additional file 1 : Initial and updated weights after 1 , 2 and 4 
iterations for Gmax- 

Additional file 2: Initial and updated weights after 1,2 and 4 
iterations forTso. 

Additional file 3: Initial and updated weights after 1,2 and 4 
iterations forTio. 

Additional file 4: ROC curves comparison across 5 models for 
evenly-spaced markers with MCAR mechanism when = 0.5. 

Additional file 5: ROC curves comparison across 5 models for 
clustered markers with MCAR mechanism when = 0.5. 

Additional file 6: ROC curves comparison across 5 models for 
evenly-spaced markers with MAR mechanism when = 0.5. 

Additional file 7: ROC curves comparison across 5 models for 
clustered markers with MAR mechanism when = 0.5. 

Additional file 8: ROC curves comparison across 5 models for 
evenly-spaced markers with MCAR mechanism when =2. 

Additional file 9: ROC curves comparison across 5 models for 
clustered markers with MCAR mechanism when = 2. 

Additional file 1 0: ROC curves comparison across 5 models for 
evenly-spaced markers with MAR mechanism when = 2. 

Additional file 11 : ROC curves comparison across 5 models for 
clustered markers with MAR mechanism when = 2 

Additional file 1 2: ROC curve comparison across 5 models for 
evenly-spaced markers with MCAR mechanism when ct^ = 3. 
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Additional file 1 3: ROC curves comparison across 5 models for 
clustered markers with IVICAR mechanism when = 3 

Additional file 14: ROC curves comparison across 5 models for 
evenly-spaced markers with MAR mechanism when ct^ = 3. 

Additional file 1 5: ROC curves comparison across 5 models for 
clustered markers with IVIAR mechanism when = 3. 
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